Long-term host–pathogen evolution of endogenous beta- and gammaretroviruses in mouse lemurs with little evidence of recent retroviral introgression

Abstract Madagascar’s flora and fauna have evolved in relative isolation since the island split from the African and Indian continents. When the last common ancestors of lemurs left Africa between 40 and 70 million years ago, they carried a subset of the viral diversity of the mainland population within them, which continued to evolve throughout the lemur radiation. Relative to other primate radiations, we know very little about the past or present viral diversity of lemurs, particularly mouse lemurs. Using high-throughput sequencing, we identified two gammaretroviruses and three betaretroviruses in the genomes of four species of wild mouse lemurs. The two gammaretroviruses and two betaretroviruses have not previously been described. One betaretrovirus was previously identified. All identified viruses are present in both Lorisiformes and Lemuriformes but absent from haplorrhine primates. The estimated ages of these viruses are consistent with the estimated divergence dates of the host lineages, suggesting they colonized the lemur genome after the Haplorrhine–Strepsirrhine split, but before the Lorisiformes–Lemuriformes split and before the colonization of Madagascar. The viral phylogenies connect multiple lineages of retroviruses from non-lemur and non-Madagascar-native species, suggesting substantial cross-species transmission occurred deep in the primate clade prior to its geographic dispersal. These phylogenies provide novel insights into known retroviral clades. They suggest that the origin of gammaretroviruses in rodents or bats may be premature and that the Jaagsiekte sheep virus clade may be older and more broadly distributed among mammals than previously thought.


Introduction
When the island of Madagascar broke away from Mainland Africa 165-121 mya (Rabinowitz, Coffin, and Falvey 1983;Roos, Schmitz, and Zischler 2004) and then from India ∼88 mya (Storey et al. 1995;Roos, Schmitz, and Zischler 2004), it created a natural laboratory (Yoder and Yang 2004) for understanding the processes of evolution. Since then, flora and fauna have evolved largely in isolation, punctuated by intermittent invasions from Africa and Asia. The first primates on Madagascar were one of those invasions; they arrived in Madagascar between 40 and 70 mya (Roos, Schmitz, and Zischler 2004;Yoder and Yang 2004;Poux et al. 2005;Perelman et al. 2011;Kistler et al. 2015;Herrera and Davalos 2016), possibly by rafting across the channel (Kappeler 2000;Roos, Schmitz, and Zischler 2004;Poux et al. 2005;Ali and Huber 2010;Nowack and Dausmann 2015;Dausmann and Warnecke 2016;Mazza, Buccianti, and Savorelli 2019;Masters et al. 2021). They subsequently radiated to fill the diverse niches that are typically occupied by monkeys and apes in Africa and Asia, making the lemur radiation on Madagascar a natural comparison with the radiations that occurred in other branches of the primate tree elsewhere.
When the last common ancestors of lemurs left Africa, they carried a subset of the viral diversity of the mainland population within them, which continued to evolve as their hosts underwent the lemur radiation as we know it today. Viruses generally evolve far more rapidly than their hosts, making current viral diversity frequently very different from past viral diversity (Gifford 2012;Greenwood et al. 2018). This discordance in evolutionary rates makes it difficult to reconstruct past viral diversity and to understand how current viruses evolved from those in the past.
Compared with other primate radiations (e.g. Catarrhines) (Switzer et al. 2005;Stengel et al. 2006;Liu et al. 2008;Locatelli and Peeters 2012;Jasinska et al. 2013), very little is known about either present or past viral diversity in lemurs. This reduces our insight into how primates and viruses have co-evolved and how it limits our ability to monitor and predict cross-species transmissions, including with humans. Large-scale studies of current, exogenous viruses in lemurs are rare but suggest that significant undescribed viral diversity is circulating among lemur populations (Zohdy et al. 2015). Even less is known about endogenous viral diversity in lemurs, but it has offered some surprising insights into the evolutionary history of some of the more extensively researched viral groups (Gifford 2012). For example, non-human primate simian immunodeficiency viruses (SIVs) (lentiviruses related to human immunodeficiency virus) were thought to occur only in Catarrhines until the recent discovery of endogenous prosimian SIV in mouse lemurs (Microcebus ssp.) and fat-tailed dwarf lemurs (Cheirogaleus medius) (Gifford et al. 2008;Gilbert et al. 2009;Rahm et al. 2011;Gifford 2012). An additional endogenized virus, a betaretrovirus, has been detected in unordered bacterial artificial sequences from the gray mouse lemur, possibly a result of having been infected by a rodent host (Baillie et al. 2004). The sequence identified contains frameshifts and premature termination mutations in the gag and pro open reading frames and deletions in the 3 ′ end of the pol gene and the 5 ′ end of the env gene. The pol gene also contains premature stop codons (Baillie et al. 2004).
We focused on the mouse lemurs (Microcebus spp.). Several lines of evidence suggest that mouse lemurs may be particularly informative for understanding viral evolution in lemurs, including as sentinel species (Zohdy et al. 2015), for detecting crossspecies spillover events. They are the smallest-bodied species of the lemurs (30-80 g), have small home ranges, subsist on an omnivorous diet, and are flexible in their use of torpor (Dausmann and Warnecke 2016;Radespiel 2016;Yoder et al. 2016;Ezran et al. 2017). This enables them to survive in diverse forests across Madagascar (Radespiel 2016;Yoder et al. 2016;Ezran et al. 2017), including in small, disturbed fragments or edge habitats that are unsustainable for larger species (Ganzhorn and Schmid 1998;Andriamandimbiarisoa et al. 2015;Setash et al. 2017;Knoop, Chikhi, and Salmona 2018;Schuessler et al. 2018;Steffens and Lehman 2018;Andriatsitohaina et al. 2019;Montero et al. 2019). As a result, they are likely to be sympatric with other wildlife in larger forest patches and with domestic animals and humans in more disturbed habitats, making them a potential link for cross-species transmissions (Zohdy et al. 2015). Moreover, because they are small, short-lived, rapidly reproducing mammals (Radespiel et al. 2019) capable of living at high densities (Setash et al. 2017), they may be able to endure higher viral loads due to reduced selection against tumors (for a review of tumors in prosimians, including mouse lemurs, see Remick, Van Wettere, and Williams 2009), similar to what has been hypothesized for why rodents and bats may be common links in cross-species retrovirus transmissions (Greenwood et al. 2018)).
Applying high-throughput sequencing (HTS) to these four species of mouse lemurs from wild populations sampled in northwestern Madagascar, we have identified multiple broadly distributed mouse lemur gamma-and betaretroviruses. They suggest substantial cross-species transmissions from multiple lineages of retroviruses from non-lemur and non-Madagascar-native species occurred prior to the primate colonization of Madagascar with subsequent host and viral evolution.

Sample collection
Four species of wild mouse lemurs ( Fig. 1) were trapped in Madagascar from August to September 2013 at four sites in the northwest (Table 1). All sites are dry deciduous forests, which experience a cool, dry season (May-October) and a hot, humid, rainy season (November-April). All sites are under heavy anthropogenic pressure. Jardin Botanique A and B (JBA and JBB) are in Ankarafantsika National Park. The forest in Mahatazana (16 ∘ 08 ′ S, 47 ∘ 41 ′ E) is near the Mahajamba River and borders rice fields. The Bombetoka site (15 ∘ 52 ′ S, 46 ∘ 15 ′ E) is approximately 20 km south of the city of Mahajanga at the western bank of the Betsiboka River. It is highly fragmented with clear signs of cattle-grazing and slash and burn clearing.
Lemurs were caught in Sherman Live traps, which were baited with banana, at dusk and collected at dawn using published procedures (Radespiel 2000;Radespiel et al. 2001). As mouse lemurs are seasonal breeders, females did not have dependent young at the time of sampling. Following veterinary protocols of the University of Veterinary Medicine Hannover, we carefully drew blood

Extraction and reverse transcription
RNA was extracted from blood using the protocol for purification of viral nucleic acids from plasma or serum (QiAamp Min Elute Virus Spin Kit, Qiagen, Germany). We modified the protocol by using 20 μl of proteinase K (Invitrogen, Germany) instead of the Qiagen protease that was provided, followed by the addition of 5 μl of linear acrylamide (5 mg/ml, Invitrogen, Germany). A DNase step was included in the AW1 buffer step. Samples were spun down using half of the recommended AW1 buffer, then added 10 μl ×10 TURBO DNase buffer (Ambion, Germany) and 1 μl of TURBO DNase (Ambion, Germany), and incubated it at 37 ∘ C for 15 min before applying the second half of the AW1 buffer and centrifuging. During the final elution step, samples were incubated for 5 min before centrifugation.
RNA was reverse transcribed into double-stranded cDNA in duplicate for each sample. Eleven microliters of each sample was mixed with 1 μl of Random Hexamers (50 μM, Invitrogen, Germany) and incubated at 65 ∘ C for 5 min and then for 2 min on ice. Four microliters of 5X Superscript III Buffer (Invitrogen, Germany) was combined with 1 μl of 10 mM dNTPs (New England BioLabs, Ipswich, USA), 1 μl of 0.1 M DTT, 1 μl of RNaseOUT Recombinant Ribonuclease Inhibitor (Invitrogen, Germany), and 1 μl Superscript III Reverse Transcriptase (Invitrogen, Germany). The samples were incubated at 25 ∘ C for 10 min, 50 ∘ C for 50 min, 85 ∘ C for 5 min, then put on ice for 2 min, incubated at 95 ∘ C for 2 min, and put on ice for 2 min. Second strand synthesis was done by adding 1 μl of DNA Polymerase I Klenow (5000 U/ml, New England Bio-Labs) and incubating at 37 ∘ C for 1 h and then at 75 ∘ C for 20 min. Double-stranded cDNA was quantified using the Qubit dsDNA HS assay (ThermoFisher Scientific, MA, USA) yielding concentrations ranging from 0.324 to 10.1 ng/μL.

DNA shearing, library building, and sequencing
Samples were sheared to a target length of 400 bp using a Covaris M220 Focused-Ultrasonicator in 50 or 15 μl tubes. When using a 50 μl tube, 20 μl of cDNA and 30 μl of water were used, following the manufacturer's instructions for obtaining a target length of 400 bp. When using a 15 μl tube, 15 μl of cDNA was used, and a treatment time of 37 s was applied. After shearing, samples were purified using a QIAquick PCR Purification Kit (Qiagen, Germany).
Double-stranded DNA samples were built into Illumina libraries using the following three steps: (1) blunt end reaction was performed using the NEBNext End Repair module (E6050; New England BioLabs) following the manufacturer's instructions. Blunt end samples were then purified using a QIAquick PCR Purification Kit and eluted at 32 μl.
(2) Illumina adapter ligation was performed on the repaired samples using NEBNext Quick Ligation module (E6056; New England BioLabs) following the manufacturer's instructions. Adapter-ligated products were purified using a QIAquick PCR Purification Kit. Reactions were eluted with 42 μl buffer EB and the column was incubated at 37 ∘ C for 5 min in a heat block before spinning down the DNA at 13,000 rpm for 1 min. (3) Adapter fill in was performed to remove nicks from the adapters using Bst Polymerase Kit (M0275S; New England BioLabs) as follows: mixing 40 μl of DNA eluted from Step 2, 5 μl of ThermoPol Reaction Buffer, 2 μl of 10 mM dNTPs (New England BioLabs), and 3 μl of Bst DNA polymerase, which was incubated in a thermal cycler for 20 min at 65 ∘ C and 20 min at 80 ∘ C. Adapter-ligated products were double indexed using the Herculase II Fusion DNA Polymerase (Agilent, Germany) in 50 μl reactions containing 10 μl Herculase Buffer, 0.5 μl dNTPs (25 mM, Agilent, Germany), 1 μl P5, 1 μl P7, 5-6.3 μl template library, 0.5 μl Herculase Polymerase, and water up to 50 μl. Amplification of the libraries was performed using the following cycling conditions: 95 ∘ C for 5 min, 5-23 cycles of 95 ∘ C for 30 s, 60 ∘ C for 30 s, and 72 ∘ C for 40 s, and a final 7 min elongation at 72 ∘ C. Illumina libraries were purified with a QIAquick PCR Purification Kit. Following purification, the forty-nine libraries were grouped into four pools of 10-15 libraries each. Size selection was performed on the pools for 250-600 bp fragments using a Pippin Prep Kit (Sage Science, Biozym Scientific in Olendorf, Germany). Final libraries were pooled equimolarly to a final library concentration of 4 nM for paired-end sequencing (2×300) on an Illumina MiSeq platform with the V3 reagent kit.

HTS data viral screening
Fastq files (SRA accession no. PRJNA779192) containing the raw data for each of the forty-nine samples that had sequencing output ranging from 200 to 400 Mb were scanned for adapter sequences and quality trimmed with Cutadapt (Marcel 2011), using a quality cut-off of 30 bp and a minimum length cut-off of 30 bp. Trimmed paired-end files for each sample were merged creating a single data file that was further processed with a modified version of the Virus Identification Pipeline (VIP). VIP (Li et al. 2016) is a bioinformatic pipeline that screens next generation sequencing data for viral hits. Each sample's fastq file was run through the pipeline using the sense mode. The pipeline first aligns the raw reads to the human genome (hg19) and filters out all reads that are mapped to the reference genome. The remaining reads are then filtered against a bacterial database to further minimize potential unspecific matches due to bacterial sequences. The curated and filtered reads were then processed and searched against a viral genome database, followed by an amino acid alignment to a viral protein database. Viral positive hits were then further processed using Geneious Prime 2020.0.3 de novo assembly to generate larger contigs.
All raw reads were also processed through the genome detective server to further validate and confirm viral positive hits. Genome detective (Vilsker et al. 2019) assembles viral genomes quickly and accurately using de novo alignments to generate viral contigs that have high combined amino acid and nucleotide scores. Both viral screening methods revealed contig matches to betaretroviral and gammaretroviral sequences from all samples.

Microcebus murinus genome retroviral screening and retroviral mining
Genome detective and VIP gamma and betaretroviral-positive contigs with an average size of 800 bp and read depth ranging from ×10 to ×300 were blast searched against the M. murinus genome v3.0 ( Supplementary Fig. S1). Positive hit coordinates were extracted from the genomic sequence ±10,000 bp. The extracted genomic sequences were annotated using Geneious Prime and aligned using MAFFT algorithm with default settings (Katoh, Rozewicki, and Yamada 2019). Alignments were used to generate majority rule consensus sequences for the proviruses and LTRs for the corresponding retroviral hits (Supplementary fasta files). Retrotector (Sperber et al. 2007) was employed to generate putative, reconstructed protein, called puteins, for all retroviral genes on the proviral consensus. Conserved motifs were further confirmed using NCBI conserved domain database (CDD) (Marchler-Bauer et al. 2011).

Recombination screening
Extracted scaffold proviral alignments generated were screened using the Recombination Analysis Tool (RAT) with the default settings for potential recombination events (Etherington, Dicks, and Roberts 2004).

Genome assemblies retroviral screening
Genome assemblies obtained from NCBI were used to determine the presence of the identified betaretroviral and gammaretroviral sequences in species other than M. murinus, M. bongolavensis, M. ravelobensis, and M. myoxinus.

Phylogenetic analysis
Multiple alignments were generated using MAFFT v 7 with an iterative refinement method for both betaretroviral and gammaretroviral nucleotide and amino acid sequences (Katoh, Rozewicki, and Yamada 2019). Further curation was performed on both alignments as needed. Additional betaretroviral and gammaretroviral sequences were obtained from GenBank with the following accession numbers: reticuloendotheliosis virus ( . Statistical selection of best fit models for the whole genome alignments and gene-specific protein alignments was performed using jModelTest2 (Darriba et al. 2012) and ProtTest3 (Darriba et al. 2011). Phylogenetic analysis of the nucleotide and protein alignments was performed with the RAxML bootstrap algorithm with 500 bootstrap replicates using the specified best fit models as determined by jmodelTest2 and ProtTest3 (Huelsenbeck and Ronquist 2001;Stamatakis 2014).

Age estimation of retroviral sequences
To estimate the identified retroviruses integration ages into the mouse lemur genome, three methods employing a molecular clock approach were used. First, the divergence of LTR sequences for each retrovirus from its consensus LTR sequence was estimated, while excluding the hypermutable CpG sites from the analysis (Mayer et al. 2013). The sequence divergence was corrected based on the Kimura-based parameter (K2P) model (Kimura 1980). The calculated sequence divergence from the consensus and the reported mouse lemur mutation rate of 0.0017/nt/myr were used to estimate the evolutionary rates of the identified retroviral sequences (Campbell et al. 2021). As a second method of age estimation, we used the sequence divergence of each aligned proviral gene from its consensus sequence, again excluding the hypermutable CpG sites from the analysis. Divergence estimates were corrected using the K2P model as earlier, and using the mouse lemur reported mutation rate, to estimate the age of each provirus. Proviral 5 ′ LTR and 3 ′ LTR sequence divergence was used as a third way to estimate the age of the endogenous elements. 5 ′ LTR and 3 ′ LTR sequences upon integration into the genome are identical, and they acquire mutations independently through time based on the host mutation rate. For each proviral sequence identified, its age was estimated employing T = D/(2*0.0017) where D is the calculated divergence between the two sequences and 0.0017 is the host mutation rate.

ERV identification in mouse lemur sequencing data
Viral screening of the cDNA sequences of all samples using the VIP and genome detective revealed the presence of multiple sequences that were homologous to the Orthoretroviridae gamma and betaretroviruses. De novo assembly of sequences was performed in all samples resulting in larger contig sequences revealing the presence of two gammaretroviruses, which we designated MicrocebusERV-1-1 and MicrocebusERV-1-2, and three betaretroviruses, which we designate MicrocebusERV-2-1, MicrocebusERV-2-2, MicrocebusERV-2-3 (previously identified by Baillie et al. 2004).

Gammaretroviruses
A blastn search of MicrocebusERV-1-1 showed an identity of 76 per cent to the Baboon ERV and MicrocebusERV-1-2 showed an identity of 75 per cent to the GaLV-brain.

Betaretroviruses
MicrocebusERV-2-1 showed an identity of 64 per cent to the endogenous virus Jaagsiekte sheep retrovirus (en-JSRV-14), MicrocebusERV-2-2 showed an identity of 67 per cent to the Nannospalax gallii ERV, MicrocebusERV-2-3 showed an identity of 66 per cent to the TvERV and an 97.1 per cent identity to AC145758, and the endogenous Microcebus murinus betaretrovirus previously identified in a bacterial artificial chromosome (BAC) sequence (Baillie et al. 2004) at the DNA level. Larger contigs for all five retroviruses were present in the majority of mouse lemur samples from all four species examined.
Consensus contig sequences for each retroviral genome identified were used as query sequences to perform a blastn search to the genome of M. murinus v3 (GCA_000165445).
Blastn searches revealed 1,446 hits for MicrocebusERV-1-1, 2,010 hits for MicrocebusERV-1-2, 457 hits for MicrocebusERV-2-1, 132 hits for MicrocebusERV-2-2, and 929 hits for MicrocebusERV-2-3 ( Table 2). Coordinates of the blastn hits were used to extract the corresponding regions from the M. murinus v3 (GCA_000165445) chromosomes or scaffold sequences, including 10,000 bp both upstream and downstream of the blastn identified region. Using approximately 10,000 bp was sufficient to obtain full length ERV genomes. For the gammaretroviruses, despite the large number of blastn hits, the analysis revealed the presence of only 20 proviral copies of MicrocebusERV-1-1 and 109 proviral copies of MicrocebusERV-1-2. For the betaretroviruses, the search found five proviral copies of MicrocebusERV-2-1, thirty-nine proviral copies of MicrocebusERV-2-2, and 138 proviral copies of MicrocebusERV-2-3 ( Table 2). The exact boundaries for each ERV locus were identified using repeatmasker using published methods (Tsangaras et al. 2015) (Supplementary fasta files). Extracted loci were further screened using the CENSOR software tool of repbase (Kohany et al. 2006). The CENSOR script compares the sequences provided with a reference database of repetitive elements and masks the homologous regions of the query sequence; at the same time, a report is generated indicating the classification of the matched repeats. Analysis of the extracted loci using CENSOR (Kohany et al. 2006) matched the sequences with both gamma and betaretroviral elements from the database verifying the initial assessment.
The exact boundaries for each provirus were annotated and aligned using MAFFT, and recombination analysis using RAT was performed in the multiple sequence alignments of the extracted proviruses identifying a recombinant MicrocebusERV (recMicro-cebusERV)-1-1 with MicrocebusERV-2-1 proviral sequences from 5,115 to 6,725 bp. The three recombinant sequences were analyzed separately.
Multiple sequence alignments of extracted proviral sequences were also used to generate majority rule consensus sequences for each provirus. The NCBI CDD (Lu et al. 2020) and retrotector (Sperber et al. 2009) were used as independent verification methods to identify the presence of retroviral motifs and genes in the consensus sequences. Retrotector analysis of the gammaretroviral majority rule consensus proviral sequence revealed gag, protease For each provirus, we indicate the contig blastn hits, the number of proviruses, the number of soloLTR sequences identified, and the consensus sequence length. Age estimations of each MicrocebusERV are also shown. N/D indicates not determined. Figure 2. Genomic structure of consensus sequence of identified ERVs. For each ERV consensus sequence, retrotector analysis identified LTRs that are illustrated with light blue, while retrotector identified proviral genes (Gag, Pro, Pol, and Env) that are illustrated in dark blue for the gammaretroviral sequences and in dark green for the betaretroviral sequences. Motifs identified using NCBI CDD for each ERV consensus sequence are illustrated in pink.
(pro), pol, and env genes in both proviral consensus sequences (Sperber et al. 2009). Retrotector analysis of the betaretroviral majority rule consensus proviral analysis was able to identify gag, protease (pro), and pol genes in all the betaretroviruses and env gene in MicrocebusERV-2-1 and MicrocebusERV 2-3 (Sperber et al. 2009). Retrotector analysis of the consensus recMicrocebusERV identified a gammaretroviral pol gene and betaretroviral env gene. A dUTPase domain was identified in all the betaretroviral proviral sequences but, as expected, was not identified in the gammaretroviral proviral sequences. The retrotector script was also able to generate reconstructed retroviral protein (puteins) sequences for the consensus sequence of each provirus (Fig. 2). A CDD analysis of the majority rule consensus sequences generated from the alignment also identified the same motifs as retrotector for all five proviral sequences and the recombinant sequence. The combined results indicate a high level of conservation of retroviral motifs among the identified proviruses and retroviral proteins.
All the viruses are expressed, as demonstrated by their retrieval from cDNA, and the majority of the generated contigs map to internal parts of the LTRs and genes. While we did not detect hard start and hard stop regions at the start of env and the end of pol, this may be due to low or inconsistent coverage.
The potential open reading frames in the proviruses suggest that multiple different products could be generated. Supplementary Table S1 shows the proviruses that could produce retroviral proteins.
Further analysis of the proviral sequences identified tRNA Pro for both gammaretroviral proviral sequences, tRNA Ile for MicrocebusERV-2-1, tRNA Phe for MicrocebusERV-2-2, and tRNA Lys for MicrocebusERV-2-3 as the tRNAs that would recognize the primer binding site (PBS) for each virus. A PBS was not identified for recMicrocebusERV-1-1. Analysis of the longest and most complete M. murinus genome hits of the identified proviruses revealed that the majority of genes are highly mutated and are unlikely to produce fully functional retroviral proteins. However, Supplementary Table S1 shows the proviruses that could potentially produce one or more fully functional retroviral proteins.

MicrocebusERV age estimation
Age estimation of the identified MicrocebusERV sequences was performed using three different approaches. First, we used a molecular clock approach using the LTR sequences. The identified LTRs for each ERV were used as a template for a blastn search. The results of each search were multiple aligned using Values are the mean percentage identities from the assemblies screened in each group.
MAFFT (Katoh et al. 2005) and manually curated. The resulting curated alignments were used to obtain the Kimura-2-parameter distances of the identified LTR sequences to each LTR majority rule consensus sequence, excluding from the alignment of the CpG dinucleotides positions due to 5-methyl cytosine spontaneous deamination that leads to higher mutation rates in that locations (Kimura 1980). Using the previously published lemur mutation rate of 0.0017/nt/year (Campbell et al. 2021), we estimate the age of each provirus (Table 2). A second dating method used the proviral gene sequences. Each proviral gene consensus sequence was blastn searched in the M. murinus genome. Resulting hits were aligned using MAFFT (Katoh et al. 2005) and manually curated. The resulting curated multiple sequence alignments were used to obtain the Kimura-2-parameter distances of the identified gene sequences to each gene majority rule consensus sequence. As with the LTR dating approach, the CpG dinucleotide positions were excluded from the analysis. The age for this approach was calculated using the lemur mutation rate indicated previously. The estimated average age using gene sequences for identified proviruses was calculated ( Table 2). The third dating method used the proviral 5 ′ LTR compared to 3 ′ LTR sequences. Retroviruses, upon integration into the host genome, have identical 5 ′ and 3 ′ LTR sequences that acquire mutations independently through time based on the mutation rate of the host. Using the divergence of the two sequences and the M. murinus mutation rate, the proviral average age was calculated for each provirus (Table 2).

Molecular screening of other primates and endemic Madagascar species genomes
Majority rule consensus sequences of each provirus were used for screening a number of primate and Madagascar endemic species genome assemblies obtained from NCBI. Strepsirrhini are estimated to have split from the Haplorrhini ∼87 million years ago (Perelman et al. 2011). Age estimations of the identified proviruses indicate that they should be present in the Strepsirrhini species but not in the Haplorrhini species. A blastn search of the Haplorrhine species failed to produce positive identification for the proviral majority rule consensus sequences. Screening of several Strepsirrhini species genome assemblies revealed positive hits for all five proviral sequences as expected based on the age estimation of the proviral sequences. This included hits in both the Lorisiformes (O. garnettii and N. coucang) and the Lemuriformes. Table 3 reports the per cent identity for each identified ERV with the Microcebus species, the Lemuriformes, and the Lorisiformes, respectively. A Blastn search was also performed on genome assemblies of Talazac's shrew tenrec (M. talazaci) and the fossa (C. ferox), two other Madagascar endemic species, to examine if the ERVs identified can be found in other isolated species on the island. A gammaretrovirus majority consensus sequences blastn search did not produce positive results with either of the genome assemblies, while the three betaretroviruses yielded positive matches with 70 per cent pairwise identity for M. talazaci. Betaretroviruses screening of the C. ferox genome assemblies produced a positive hit with 70 per cent pairwise identity in the gag gene of MicrocebusERV-2-3. Phylogenetic analysis of the extracted hits from C. ferox and M. talazaci place them in the same clades as the identified ERVs.

Phylogenetic analysis of retroviral genomes
Retroviral sequences with nucleotide percentage identity >25 per cent to identified mouse lemur ERVs were extracted and aligned using MAFFT. Maximum likelihood phylogenetic analysis was performed on the resulting alignment using SIV as an outgroup. The resulting phylogeny clearly separated the two retroviral genera beta and gamma into two separate nodes (Fig. 3). In the gammaretroviral node, MicrocebusERV-1-2 was placed in the same clade as the RfRV, while the MicrocebusERV-1-1 and recMicrocebusERV-1-1 proviruses were in the same clade as UrsusERV (Tsangaras et al. 2015) (Fig. 3).
In the betaretroviral node, MicrocebusERV-2-1 was placed in the same clade as JSRV sequences, MicrocebusERV-2-2 was in the same clade as Mouse Mammary Tumor virus and HERV-K ERVs, while MicrocebusERV-2-3 clusters with the TvERV (Baillie and Wilkins 2001) (Fig. 3). A maximum likelihood phylogeny of the putein sequence generated from retrotector was also performed. Results of the protein phylogeny were almost identical to the nucleotide sequence analysis, though some clades in the proteinbased tree had lower confidence most likely due to insufficient informative sites ( Supplementary Figs S2 and S3).

Discussion
Five ERVs in the Microcebus genome, two gammatretroviruses and three betaretroviruses, were identified. All five were found to be present in the Strepsirrhines, including both Lorisiformes and Lemuriformes, but absent from the Haplorrhines, including the tarsier. This suggests that they originated after the Haplorrhine-Strepsirrhine split approximately 87 mya (95 per cent highest posterior density [HPD]: 76-99) but before the Strepsirrhini split into the Lorisiformes and Lemuriformes approximately 69 mya (95 per cent HPD: 59-77) (Perelman et al. 2011).
Moreover, these estimated viral ages are also consistent with evidence that Madagascar was colonized by ancestral lemurs 77-39 mya (Perelman et al. 2011). As would be expected for a virus that originated before the lorisiform-lemuriform split, none of these viruses are dated exclusively after the date range of the colonization. Together, this evidence presents a consistent scenario in which these viruses likely originated in ancestral Strepsirrhines, after the divergence of Haplorrhines, but before the divergence Figure 3. Maximum likelihood phylogenetic analysis of gammaretroviral and betaretroviral majority rule consensus MicrocebusERV sequences compared with NCBI extracted sequences. SIV was used as an outgroup. MicrocebusERV gamma proviral sequences identified are highlighted in blue, while MicrocebusERV beta proviral sequences are in green. MicrocebusERV-1-2 is clustered in the same clade as RfRV and REV, while MicrocebusERV-1-1, recMicrocebus-1-1, and UrsusERV are clustered in the same clade. MicrocebusERV-2-1 lies in the same clade as JSRV sequences, MicrocebusERV-2-2 is in the same clade as Mouse Mammary Tumor virus and HERV-K ERVs, while MicrocebusERV-2-3 is located in the same clade as TvERV.
of Lemuriformes and Lorisiformes, i.e. before the colonization of Madagascar.
The betaretroviruses also produced hits in other phylogenetically distant species that are sympatric with lemurs: fossa (C. ferox) and tenrecs (M. talazaci). The fossa genome produced hits with MicrocebusERV-2-3 and the tenrec genome produced hits with all three betaretroviruses. Carnivora (which includes the fossa) are estimated to have colonized Madagascar between 26 and 19 mya (33-14 mya) and tenrecs between 42 and 25 mya (50-20 mya) (Poux et al. 2005). Because the tenrec confidence intervals overlap with those of both the Carnivora and lemur colonization events, it is possible that tenrecs may have arrived simultaneously, via rafting, with either carnivores or lemurs (Poux et al. 2005). Alternatively, tenrecs may have arrived via a separate rafting event or a land bridge hypothesized to have existed between 45 and 26 mya (Poux et al. 2005;Masters et al. 2021). These colonization dates are broadly consistent with the estimated ages of the betaretroviruses, suggesting that transmission events between the lemurs and the other two taxa may have been possible. One possible scenario is that MicrocebusERV-2-3 was transmitted in Madagascar between lemurs, carnivores that preyed on them, and tenrecs, which were also likely prey. Alternative scenarios include that (1) tenrecs and carnivores may have been independently infected from a common source either before or after colonizing Madagascar, or (2) the primates, fossa, and tenrecs may be carrying closely related but phylogenetically distinct ERVs. For MicrocebusERV-2-1 and Microcebus-2-2, homologs of which were found in the tenrecs but not the fossa, it is not yet clear if and how these species might be linked. Possibilities include that another species may have interacted with both or that the primates and tenrecs may be carrying closely related but phylogenetically distinct ERVs.
It is likely that the proviruses we identified are not intact enough to replicate. While some of the proviruses could potentially produce one or more fully functional proteins (Supplementary Table S1), the majority of genes we detected contained mutations. Although we did not detect env in MicrocebusERV-2-2 (Fig. 2), and the loss of env functionality can be associated with a switch from transmitting via infection to transmitting via other mechanisms, like retrotransposition (Magiorkinis et al. 2012), we expect that the mutations throughout the other genes render these proviruses incapable of any current replication. Of the two MicrocebusERV-2-2 proviral sequences identified in Supplementary Table S1 as having the potential to code for proteins, one has the potential to code for a gag protein and the other for a pol protein. Whether the copy numbers observed in our data (ranging from five for MicrocebusERV-2-1 to 138 for MicrocebusERV-2-3) are all the result of reinfections or whether some are due to retrotransposing after losing env functionality, but before the subsequent degradation of the other genes, is difficult to determine.
The recombinant virus, recMicrocebus ERV-1-1, contains a gammareretroviral pol gene from MicrocebusERV-1-1 and a betaretroviral env gene from MicrocebusERV-2-1. This differs from the well-documented pattern in which betaretroviruses acquire gammaretroviral env genes, a pattern that has been argued to enable betaretroviruses to expand the range of hosts that they can infect (Henzy and Coffin 2013;Henzy and Johnson 2013). To our knowledge, the evolutionary implications of acquiring a betaretroviral env are less well understood and possibly less frequent; however, they do indicate a complex history of retroviral cross-genera recombination.

The gammaretroviruses
Gammaretroviruses are estimated to have originated early in mammalian evolution and have demonstrated a pattern of frequent intra-order host switching (Hayward, Grabherr, and Jern 2013;Hayward, Cornwallis, and Jern 2015). This pattern appears to also be evident in Strepsirrhines, in that the endogenous gammaretroviruses we detected did not form a clade of their own or with the endogenous gammaretroviruses found in other primates.

Microcebus ERV-1-1
MicrocebusERV-1-1 provirus was placed in the same clade with UrsusERV and is estimated to be 60 (±28) million years old. UrsusERV age estimations indicate that the provirus is relatively young and believed to have been circulating in the bear population approximately 10 mya (Tsangaras et al. 2015). Given the large time gap between the origins of MicrocebusERV-1-1 in strepsirrhine primates and the origins of the UmaERV in bears, it is likely that these represent independent germline colonization events by related exogenous retroviruses.

Microcebus ERV-1-2
MicrocebusERV-1-2 was dated to about 59 (±58) million years old. It is in the same clade as the RfRV and CrERV. RfRV is basal to most mammalian gammaretroviruses (Cui et al. 2012;Cui, Tachedjian, and Wang 2015). It was detected in the greater horseshoe bat (R. ferrumequinum) (Cui et al. 2012), but is thought to have originated in tree shrews (Cui, Tachedjian, and Wang 2015). The similarity of MicrocebusERV-1-2 to RfRV supports the idea that there has been viral transmission among bats, rodents, and primates by related exogenous retroviruses since the beginning of the primate radiation. The basal position of the mouse lemur virus suggests that a hypothesized bat or rodent origin for gammaretroviruses may be premature (Cui et al. 2012;Cui, Tachedjian, and Wang 2015).
CrERV is a gammaretrovirus, which is part of a clade that is found in cervids, including mule deer (Odocoileus hemionus), whitetailed deer (Odocoileus virginianus), elk (Cervus canadiensis), and muntjac (Muntiacus muntiak) (Elleder et al. 2012). The presence of this clade of viruses in numerous cervids suggests that the original germline colonization may have occurred before the cervid lineages split approximately 10 mya (Gilbert, Ropiquet, and Hassanin 2006;Elleder et al. 2012). This clade is more distantly related to the endogenous gammaretroviruses of pigs and sheep (Elleder et al. 2012). CrERV, specifically, has been documented to have invaded the mule deer genome when the mule deer and whitetailed deer lineages separated about 1 mya (Elleder et al. 2012). The old age of MicrocebusERV-1-2 and the clustering of viruses from distantly related clades found in distantly related hosts suggest complicated viral transmission patterns that are not yet fully resolved.

Betaretroviruses
Betaretroviruses are highly abundant in vertebrates, showing strong patterns of intra-class host switching (Hayward, Cornwallis, and Jern 2015). The three endogenous betaretroviruses in the Strepsirrhines are consistent with that pattern in that they do not form a clade together, or with other primates, and suggest that there may have been cross-species transmissions involving fossa and tenrecs. However, we do acknowledge that it is also possible that the betaretroviruses in the Strepsirrhines, fossa, and tenrecs could be closely related, but phylogenetically distinct, viruses.

Microcebus ERV-2-1
MicrocebusERV-2-1 is estimated to be 44 (±48) million years old and in the same clade as JSRV. This represents the first nonbovid virus in this viral clade with a 64 per cent identity. Some of the proviral sequences of ERV-2-1 produced a hit for orf-x, an accessory gene with unknown function that overlaps with pol, is conserved and thought to be unique to JSRV (Rosati et al. 2000;Griffiths, Martineau, and Cousens 2010;Hofacre and Fan 2010). JSRV has a complex and dynamic evolutionary history and, to our knowledge, is the only known aerosol-transmitted retrovirus. While viruses in the JSRV clade are common in Bovids, older work using Southern blot hybridization suggests that related viruses are also present in other taxa, including some carnivores and primates (Hecht et al. 1996). It is first integrated into the genome of Ovis and Capra prior to their divergence 5-11 mya (Armezzani et al. 2014;Cumer, Pompanon, and Boyer 2019). Some of the endogenized copies are associated with placental development and may be protective against closely related pathogenic JSRV, making the sheep-JSRV model useful for understanding co-evolutionary dynamics (Armezzani et al. 2014;Cumer, Pompanon, and Boyer 2019). The presence of this virus in Lorisiformes as well as Lemuriformes and its basal position to the JSRVs suggest that the integration time in lemurs is likely to be well before the integration into the germlines of Ovis and Capra and that this viral group may be more widespread and diverse among wildlife than previously thought.

Microcebus ERV-2-2
MicrocebusERV-2-2 was dated at 44 (±39) million years old and was placed in the same clade as Mouse Mammary Tumor virus and HERV-K ERVs. These groups are associated with cancers ( (Kim et al. 1999;Lavie et al. 2004). Given the frequency of host switching in betaretroviruses (Hayward, Cornwallis, and Jern 2015), it is not possible to determine whether these germline colonization events, or the colonization of the tenrec germline, have a single origin.

Microcebus ERV-2-3
MicrocebusERV-2-3 was estimated to be 60 (±56) million years old. It is basal to the other four identified retroviral clades (Fig. 3). As ERV-2-3 is 97.1 per cent similar to AC145758, it is very likely to be the same endogenous M. murinus betaretrovirus previously identified in a BAC sequence (Baillie et al. 2004). The virus is basal in the phylogenetic tree and appears to be in the same clade as TvERV, an endogenous type D betaretrovirus found in the common brushtail opossum, T. vulpecula, and retrovirus elements in the mouse and rat genomes. MicrocebusERV-2-3 and TvERV also formed a clade with the SRV ((Baillie and Wilkins 2001) and Fig. 3), with SMRV-H and DrERV (from vampire bats in Mexico). Thus, this clade includes viruses infecting hosts which are both phylogenetically and geographically distant, suggesting a complex and still largely unknown history of cross-species transmission.

Conclusion
This study indicates that Strepsirrhines have been evolving and experiencing retroviral genome colonization events since before the lorisiform-lemuriform split. That the mouse lemur viruses largely cluster with viruses in non-lemur, and indeed, nonprimate hosts, suggests that patterns of frequent host switching by gamma-and betaretroviruses have also occurred in these lineages. Much of the diversification of ERVs in mouse lemurs took place before the colonization of Madagascar, though further infection of other species in Madagascar, such as tenrecs and fossas, could potentially have occurred subsequently. Some of the discovered phylogenetic associations suggest interspecies transmission is both temporally and spatially complex, e.g. the squirrel monkey-and vampire bat-like MicrocebusERV-2-3, or suggests the origins of known clades may be far older than previously estimated, e.g. the JSRV-like MicrocebusERV-2-1. Further sequencing of wildlife will likely help identify and resolve the associations of many additional retroviral clades.

Supplementary data
Supplementary data are available at Virus Evolution online.